When it comes to the world of businesses, having good reviews equals good economical flow in most cases, but just how influential can reviews really be? For our project, we are looking at the effects of positive and negative reviews on businesses using Yelp’s review dataset. With that in mind, we also want to show how successful a business can get based off of initial reviews.
For this project, the main coding language will be Python. Since we are using the over 8 million reviews as a baseline to solve our problem, we will be using Spark so we can be flexible in the ways we download/upload the data from yelp to our iPynb.
!pip install "notebook>=5.3" "ipywidgets>=7.5"
!pip install plotly==5.3.1
#!pip install pyspark
!pip install sklearn
Requirement already satisfied: notebook>=5.3 in /opt/conda/lib/python3.9/site-packages (6.4.4) Requirement already satisfied: ipywidgets>=7.5 in /opt/conda/lib/python3.9/site-packages (7.6.5) Requirement already satisfied: jupyter-client>=5.3.4 in /opt/conda/lib/python3.9/site-packages (from notebook>=5.3) (7.0.6) Requirement already satisfied: nbformat in /opt/conda/lib/python3.9/site-packages (from notebook>=5.3) (5.1.3) Requirement already satisfied: nbconvert in /opt/conda/lib/python3.9/site-packages (from notebook>=5.3) (6.2.0) Requirement already satisfied: prometheus-client in /opt/conda/lib/python3.9/site-packages (from notebook>=5.3) (0.11.0) Requirement already satisfied: argon2-cffi in /opt/conda/lib/python3.9/site-packages (from notebook>=5.3) (21.1.0) Requirement already satisfied: Send2Trash>=1.5.0 in /opt/conda/lib/python3.9/site-packages (from notebook>=5.3) (1.8.0) Requirement already satisfied: pyzmq>=17 in /opt/conda/lib/python3.9/site-packages (from notebook>=5.3) (22.3.0) Requirement already satisfied: terminado>=0.8.3 in /opt/conda/lib/python3.9/site-packages (from notebook>=5.3) (0.12.1) Requirement already satisfied: ipython-genutils in /opt/conda/lib/python3.9/site-packages (from notebook>=5.3) (0.2.0) Requirement already satisfied: ipykernel in /opt/conda/lib/python3.9/site-packages (from notebook>=5.3) (6.4.1) Requirement already satisfied: jupyter-core>=4.6.1 in /opt/conda/lib/python3.9/site-packages (from notebook>=5.3) (4.8.1) Requirement already satisfied: traitlets>=4.2.1 in /opt/conda/lib/python3.9/site-packages (from notebook>=5.3) (5.1.0) Requirement already satisfied: tornado>=6.1 in /opt/conda/lib/python3.9/site-packages (from notebook>=5.3) (6.1) Requirement already satisfied: jinja2 in /opt/conda/lib/python3.9/site-packages (from notebook>=5.3) (3.0.2) Requirement already satisfied: widgetsnbextension~=3.5.0 in /opt/conda/lib/python3.9/site-packages (from ipywidgets>=7.5) (3.5.1) Requirement already satisfied: ipython>=4.0.0 in /opt/conda/lib/python3.9/site-packages (from ipywidgets>=7.5) (7.28.0) Requirement already satisfied: jupyterlab-widgets>=1.0.0 in /opt/conda/lib/python3.9/site-packages (from ipywidgets>=7.5) (1.0.2) Requirement already satisfied: matplotlib-inline<0.2.0,>=0.1.0 in /opt/conda/lib/python3.9/site-packages (from ipykernel->notebook>=5.3) (0.1.3) Requirement already satisfied: debugpy<2.0,>=1.0.0 in /opt/conda/lib/python3.9/site-packages (from ipykernel->notebook>=5.3) (1.4.1) Requirement already satisfied: pickleshare in /opt/conda/lib/python3.9/site-packages (from ipython>=4.0.0->ipywidgets>=7.5) (0.7.5) Requirement already satisfied: pexpect>4.3 in /opt/conda/lib/python3.9/site-packages (from ipython>=4.0.0->ipywidgets>=7.5) (4.8.0) Requirement already satisfied: setuptools>=18.5 in /opt/conda/lib/python3.9/site-packages (from ipython>=4.0.0->ipywidgets>=7.5) (58.2.0) Requirement already satisfied: prompt-toolkit!=3.0.0,!=3.0.1,<3.1.0,>=2.0.0 in /opt/conda/lib/python3.9/site-packages (from ipython>=4.0.0->ipywidgets>=7.5) (3.0.20) Requirement already satisfied: decorator in /opt/conda/lib/python3.9/site-packages (from ipython>=4.0.0->ipywidgets>=7.5) (5.1.0) Requirement already satisfied: jedi>=0.16 in /opt/conda/lib/python3.9/site-packages (from ipython>=4.0.0->ipywidgets>=7.5) (0.18.0) Requirement already satisfied: backcall in /opt/conda/lib/python3.9/site-packages (from ipython>=4.0.0->ipywidgets>=7.5) (0.2.0) Requirement already satisfied: pygments in /opt/conda/lib/python3.9/site-packages (from ipython>=4.0.0->ipywidgets>=7.5) (2.10.0) Requirement already satisfied: entrypoints in /opt/conda/lib/python3.9/site-packages (from jupyter-client>=5.3.4->notebook>=5.3) (0.3) Requirement already satisfied: nest-asyncio>=1.5 in /opt/conda/lib/python3.9/site-packages (from jupyter-client>=5.3.4->notebook>=5.3) (1.5.1) Requirement already satisfied: python-dateutil>=2.1 in /opt/conda/lib/python3.9/site-packages (from jupyter-client>=5.3.4->notebook>=5.3) (2.8.2) Requirement already satisfied: jsonschema!=2.5.0,>=2.4 in /opt/conda/lib/python3.9/site-packages (from nbformat->notebook>=5.3) (4.1.0) Requirement already satisfied: ptyprocess in /opt/conda/lib/python3.9/site-packages (from terminado>=0.8.3->notebook>=5.3) (0.7.0) Requirement already satisfied: cffi>=1.0.0 in /opt/conda/lib/python3.9/site-packages (from argon2-cffi->notebook>=5.3) (1.14.6) Requirement already satisfied: MarkupSafe>=2.0 in /opt/conda/lib/python3.9/site-packages (from jinja2->notebook>=5.3) (2.0.1) Requirement already satisfied: mistune<2,>=0.8.1 in /opt/conda/lib/python3.9/site-packages (from nbconvert->notebook>=5.3) (0.8.4) Requirement already satisfied: bleach in /opt/conda/lib/python3.9/site-packages (from nbconvert->notebook>=5.3) (4.1.0) Requirement already satisfied: jupyterlab-pygments in /opt/conda/lib/python3.9/site-packages (from nbconvert->notebook>=5.3) (0.1.2) Requirement already satisfied: nbclient<0.6.0,>=0.5.0 in /opt/conda/lib/python3.9/site-packages (from nbconvert->notebook>=5.3) (0.5.4) Requirement already satisfied: testpath in /opt/conda/lib/python3.9/site-packages (from nbconvert->notebook>=5.3) (0.5.0) Requirement already satisfied: pandocfilters>=1.4.1 in /opt/conda/lib/python3.9/site-packages (from nbconvert->notebook>=5.3) (1.5.0) Requirement already satisfied: defusedxml in /opt/conda/lib/python3.9/site-packages (from nbconvert->notebook>=5.3) (0.7.1) Requirement already satisfied: pycparser in /opt/conda/lib/python3.9/site-packages (from cffi>=1.0.0->argon2-cffi->notebook>=5.3) (2.20) Requirement already satisfied: parso<0.9.0,>=0.8.0 in /opt/conda/lib/python3.9/site-packages (from jedi>=0.16->ipython>=4.0.0->ipywidgets>=7.5) (0.8.2) Requirement already satisfied: pyrsistent!=0.17.0,!=0.17.1,!=0.17.2,>=0.14.0 in /opt/conda/lib/python3.9/site-packages (from jsonschema!=2.5.0,>=2.4->nbformat->notebook>=5.3) (0.17.3) Requirement already satisfied: attrs>=17.4.0 in /opt/conda/lib/python3.9/site-packages (from jsonschema!=2.5.0,>=2.4->nbformat->notebook>=5.3) (21.2.0) Requirement already satisfied: wcwidth in /opt/conda/lib/python3.9/site-packages (from prompt-toolkit!=3.0.0,!=3.0.1,<3.1.0,>=2.0.0->ipython>=4.0.0->ipywidgets>=7.5) (0.2.5) Requirement already satisfied: six>=1.5 in /opt/conda/lib/python3.9/site-packages (from python-dateutil>=2.1->jupyter-client>=5.3.4->notebook>=5.3) (1.16.0) Requirement already satisfied: webencodings in /opt/conda/lib/python3.9/site-packages (from bleach->nbconvert->notebook>=5.3) (0.5.1) Requirement already satisfied: packaging in /opt/conda/lib/python3.9/site-packages (from bleach->nbconvert->notebook>=5.3) (21.0) Requirement already satisfied: pyparsing>=2.0.2 in /opt/conda/lib/python3.9/site-packages (from packaging->bleach->nbconvert->notebook>=5.3) (2.4.7) Requirement already satisfied: plotly==5.3.1 in /opt/conda/lib/python3.9/site-packages (5.3.1) Requirement already satisfied: tenacity>=6.2.0 in /opt/conda/lib/python3.9/site-packages (from plotly==5.3.1) (8.0.1) Requirement already satisfied: six in /opt/conda/lib/python3.9/site-packages (from plotly==5.3.1) (1.16.0) Requirement already satisfied: sklearn in /opt/conda/lib/python3.9/site-packages (0.0) Requirement already satisfied: scikit-learn in /opt/conda/lib/python3.9/site-packages (from sklearn) (1.0) Requirement already satisfied: threadpoolctl>=2.0.0 in /opt/conda/lib/python3.9/site-packages (from scikit-learn->sklearn) (3.0.0) Requirement already satisfied: numpy>=1.14.6 in /opt/conda/lib/python3.9/site-packages (from scikit-learn->sklearn) (1.20.3) Requirement already satisfied: scipy>=1.1.0 in /opt/conda/lib/python3.9/site-packages (from scikit-learn->sklearn) (1.7.1) Requirement already satisfied: joblib>=0.11 in /opt/conda/lib/python3.9/site-packages (from scikit-learn->sklearn) (1.1.0)
import json
import random
import numpy as np
from pyspark import SparkContext
from pyspark.sql import SparkSession
from pyspark.sql import functions as F
from pyspark.sql import Window
import plotly.express as px
from sklearn.linear_model import LinearRegression
px.set_mapbox_access_token("pk.eyJ1IjoicmV4dGVyZHMiLCJhIjoiY2t2dWgwazZ2M3E2YTJucWd3aDY1Yzd3dyJ9.r-rgjBmvcCMR1tBu8imGUQ")
sc = SparkContext()
ss = SparkSession(sc)
WARNING: An illegal reflective access operation has occurred WARNING: Illegal reflective access by org.apache.spark.unsafe.Platform (file:/usr/local/spark-3.1.2-bin-hadoop3.2/jars/spark-unsafe_2.12-3.1.2.jar) to constructor java.nio.DirectByteBuffer(long,int) WARNING: Please consider reporting this to the maintainers of org.apache.spark.unsafe.Platform WARNING: Use --illegal-access=warn to enable warnings of further illegal reflective access operations WARNING: All illegal access operations will be denied in a future release 21/12/16 00:30:55 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties Setting default log level to "WARN". To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
business = ss.read.json("data/yelp_academic_dataset_business.json")
checkin = ss.read.json("data/yelp_academic_dataset_checkin.json")
review = ss.read.json("data/yelp_academic_dataset_review.json")
tip = ss.read.json("data/yelp_academic_dataset_tip.json")
user = ss.read.json("data/yelp_academic_dataset_user.json")
21/12/16 00:31:08 WARN package: Truncated the string representation of a plan since it was too large. This behavior can be adjusted by setting 'spark.sql.debug.maxToStringFields'.
business.printSchema()
print(business.count())
root |-- address: string (nullable = true) |-- attributes: struct (nullable = true) | |-- AcceptsInsurance: string (nullable = true) | |-- AgesAllowed: string (nullable = true) | |-- Alcohol: string (nullable = true) | |-- Ambience: string (nullable = true) | |-- BYOB: string (nullable = true) | |-- BYOBCorkage: string (nullable = true) | |-- BestNights: string (nullable = true) | |-- BikeParking: string (nullable = true) | |-- BusinessAcceptsBitcoin: string (nullable = true) | |-- BusinessAcceptsCreditCards: string (nullable = true) | |-- BusinessParking: string (nullable = true) | |-- ByAppointmentOnly: string (nullable = true) | |-- Caters: string (nullable = true) | |-- CoatCheck: string (nullable = true) | |-- Corkage: string (nullable = true) | |-- DietaryRestrictions: string (nullable = true) | |-- DogsAllowed: string (nullable = true) | |-- DriveThru: string (nullable = true) | |-- GoodForDancing: string (nullable = true) | |-- GoodForKids: string (nullable = true) | |-- GoodForMeal: string (nullable = true) | |-- HairSpecializesIn: string (nullable = true) | |-- HappyHour: string (nullable = true) | |-- HasTV: string (nullable = true) | |-- Music: string (nullable = true) | |-- NoiseLevel: string (nullable = true) | |-- Open24Hours: string (nullable = true) | |-- OutdoorSeating: string (nullable = true) | |-- RestaurantsAttire: string (nullable = true) | |-- RestaurantsCounterService: string (nullable = true) | |-- RestaurantsDelivery: string (nullable = true) | |-- RestaurantsGoodForGroups: string (nullable = true) | |-- RestaurantsPriceRange2: string (nullable = true) | |-- RestaurantsReservations: string (nullable = true) | |-- RestaurantsTableService: string (nullable = true) | |-- RestaurantsTakeOut: string (nullable = true) | |-- Smoking: string (nullable = true) | |-- WheelchairAccessible: string (nullable = true) | |-- WiFi: string (nullable = true) |-- business_id: string (nullable = true) |-- categories: string (nullable = true) |-- city: string (nullable = true) |-- hours: struct (nullable = true) | |-- Friday: string (nullable = true) | |-- Monday: string (nullable = true) | |-- Saturday: string (nullable = true) | |-- Sunday: string (nullable = true) | |-- Thursday: string (nullable = true) | |-- Tuesday: string (nullable = true) | |-- Wednesday: string (nullable = true) |-- is_open: long (nullable = true) |-- latitude: double (nullable = true) |-- longitude: double (nullable = true) |-- name: string (nullable = true) |-- postal_code: string (nullable = true) |-- review_count: long (nullable = true) |-- stars: double (nullable = true) |-- state: string (nullable = true)
[Stage 5:=======> (1 + 7) / 8]
160585
checkin.printSchema()
print(checkin.count())
root |-- business_id: string (nullable = true) |-- date: string (nullable = true)
[Stage 7:====================================> (5 + 3) / 8]
138876
review.printSchema()
print(review.count())
root |-- business_id: string (nullable = true) |-- cool: long (nullable = true) |-- date: string (nullable = true) |-- funny: long (nullable = true) |-- review_id: string (nullable = true) |-- stars: double (nullable = true) |-- text: string (nullable = true) |-- useful: long (nullable = true) |-- user_id: string (nullable = true)
[Stage 9:======================================================> (50 + 2) / 52]
8635403
tip.printSchema()
print(tip.count())
root |-- business_id: string (nullable = true) |-- compliment_count: long (nullable = true) |-- date: string (nullable = true) |-- text: string (nullable = true) |-- user_id: string (nullable = true)
[Stage 11:> (0 + 8) / 8]
1162119
user.printSchema()
print(user.count())
root |-- average_stars: double (nullable = true) |-- compliment_cool: long (nullable = true) |-- compliment_cute: long (nullable = true) |-- compliment_funny: long (nullable = true) |-- compliment_hot: long (nullable = true) |-- compliment_list: long (nullable = true) |-- compliment_more: long (nullable = true) |-- compliment_note: long (nullable = true) |-- compliment_photos: long (nullable = true) |-- compliment_plain: long (nullable = true) |-- compliment_profile: long (nullable = true) |-- compliment_writer: long (nullable = true) |-- cool: long (nullable = true) |-- elite: string (nullable = true) |-- fans: long (nullable = true) |-- friends: string (nullable = true) |-- funny: long (nullable = true) |-- name: string (nullable = true) |-- review_count: long (nullable = true) |-- useful: long (nullable = true) |-- user_id: string (nullable = true) |-- yelping_since: string (nullable = true)
[Stage 13:======================================================> (27 + 1) / 28]
2189457
review_counts = (business
.select("name", "latitude", "longitude", "review_count", "stars", "state")
.sort("stars", "review_count", ascending=False))
review_counts.show()
[Stage 15:=============================> (4 + 4) / 8]
+--------------------+-------------+--------------+------------+-----+-----+ | name| latitude| longitude|review_count|stars|state| +--------------------+-------------+--------------+------------+-----+-----+ |Powell's City of ...| 45.5230969| -122.681325| 4295| 5.0| OR| | Aviva by Kameel|33.7605381455|-84.3865554648| 1532| 5.0| GA| | Stanley Park| 49.2978842| -123.1308093| 939| 5.0| BC| |Ovation Coffee & Tea| 45.532588| -122.6810282| 740| 5.0| OR| | Pedal Bike Tours| 45.521572| -122.6729487| 736| 5.0| OR| | Ciao! Pizza & Pasta| 42.3892206| -71.0408823| 733| 5.0| MA| | 1618 Asian Fusion| 30.2453207| -97.7303193| 641| 5.0| TX| |Lake Travis Zipli...| 30.42903| -97.890062| 584| 5.0| TX| |Brakes To Go - Mo...| 30.2729209| -97.7443863| 577| 5.0| TX| | Jewboy Burgers| 30.3127408| -97.7150455| 516| 5.0| TX| | Swiss Hibiscus| 45.5587705| -122.651453| 510| 5.0| OR| |Peace Frog Specia...| 30.2648272| -97.7341597| 509| 5.0| TX| | Tōv| 45.512267| -122.6323559| 503| 5.0| OR| |Einstein Moving C...| 30.348606| -97.67326| 465| 5.0| TX| |Realty Austin - C...| 30.3154649| -97.7346205| 441| 5.0| TX| | Car Zone| 42.3748556| -71.0937018| 426| 5.0| MA| |Auto Glass Solutions| 30.380665| -97.7237| 411| 5.0| TX| |Always Available ...|30.2481061334|-97.7615793065| 411| 5.0| TX| |T-Loc's Sonora Ho...| 30.3211689| -97.7395055| 410| 5.0| TX| | Stairhopper Movers| 42.3796489| -71.0717834| 410| 5.0| MA| +--------------------+-------------+--------------+------------+-----+-----+ only showing top 20 rows
similar_categories = (business
.select("name", "categories", "state", "stars")
.sort("categories", "state", ascending=False))
similar_categories.show()
+--------------------+--------------------+-----+-----+ | name| categories|state|stars| +--------------------+--------------------+-----+-----+ |Nutria Pond behin...|Zoos, Parks, Acti...| OR| 3.5| | Oregon Zoo|Zoos, Parks, Acti...| OR| 4.0| |Manatee Observati...|Zoos, Local Servi...| FL| 4.5| |Rose and Grace Mi...|Zoos, Event Plann...| TX| 5.0| | Jungle Adventures|Zoos, Active Life...| FL| 4.0| | Franklin Park Zoo| Zoos, Active Life| MA| 3.5| |Maharajah Jungle ...| Zoos, Active Life| FL| 4.5| |Zoom Air Adventur...|Ziplining, Botani...| FL| 4.5| | Orlando Tree Trek|Ziplining, Active...| FL| 4.5| |Grouse Mountain Z...|Ziplining, Active...| BC| 4.0| | LA Fitness|Yoga, Trainers, S...| OH| 2.0| | Cycle Elan|Yoga, Trainers, G...| GA| 5.0| | LA Fitness|Yoga, Trainers, G...| GA| 2.5| | Blueprint Fitness|Yoga, Trainers, G...| GA| 5.0| | Full Circle Yoga|Yoga, Trainers, F...| FL| 5.0| |Gold's Gym Austin...|Yoga, Trainers, F...| TX| 3.0| | Anytime Fitness|Yoga, Trainers, F...| CO| 5.0| | Yoga Yoga|Yoga, Trainers, F...| TX| 4.5| |Angel's Boot Camp...|Yoga, Trainers, B...| OR| 5.0| | Health Yoga Life|Yoga, Trainers, A...| MA| 4.0| +--------------------+--------------------+-----+-----+ only showing top 20 rows
categories = (business
.select("business_id", "categories")
.withColumn("categories", F.explode(F.split(F.col("categories"), ", "))))
categories.show()
+--------------------+--------------------+ | business_id| categories| +--------------------+--------------------+ |6iYb2HFDywm3zjuRg...| Gastropubs| |6iYb2HFDywm3zjuRg...| Food| |6iYb2HFDywm3zjuRg...| Beer Gardens| |6iYb2HFDywm3zjuRg...| Restaurants| |6iYb2HFDywm3zjuRg...| Bars| |6iYb2HFDywm3zjuRg...|American (Traditi...| |6iYb2HFDywm3zjuRg...| Beer Bar| |6iYb2HFDywm3zjuRg...| Nightlife| |6iYb2HFDywm3zjuRg...| Breweries| |tCbdrRPZA0oiIYSmH...| Salad| |tCbdrRPZA0oiIYSmH...| Soup| |tCbdrRPZA0oiIYSmH...| Sandwiches| |tCbdrRPZA0oiIYSmH...| Delis| |tCbdrRPZA0oiIYSmH...| Restaurants| |tCbdrRPZA0oiIYSmH...| Cafes| |tCbdrRPZA0oiIYSmH...| Vegetarian| |bvN78flM8NLprQ1a1...| Antiques| |bvN78flM8NLprQ1a1...| Fashion| |bvN78flM8NLprQ1a1...| Used| |bvN78flM8NLprQ1a1...|Vintage & Consign...| +--------------------+--------------------+ only showing top 20 rows
categories_count = (categories
.groupBy("categories")
.count()
.sort("count", ascending=False))
categories_count.show()
[Stage 19:==================================================> (186 + 9) / 200]
+--------------------+-----+ | categories|count| +--------------------+-----+ | Restaurants|50763| | Food|29469| | Shopping|26205| | Beauty & Spas|16574| | Home Services|16465| | Health & Medical|15102| | Local Services|12192| | Nightlife|11990| | Bars|10741| | Automotive|10119| |Event Planning & ...| 9644| | Active Life| 9231| | Coffee & Tea| 7725| | Sandwiches| 7272| | Fashion| 6599| |American (Traditi...| 6541| | Hair Salons| 5900| | Pizza| 5756| | Hotels & Travel| 5703| | Breakfast & Brunch| 5505| +--------------------+-----+ only showing top 20 rows
categories_df = categories_count.limit(20).toPandas()
categories_histogram = px.histogram(categories_df, y="categories", x="count")
categories_histogram.show()
business_checkin = (business
.join(checkin, "business_id")
.withColumn("checkin_count", F.size(F.split(F.col("date"), ",")))
.select("business_id", "latitude", "longitude", "checkin_count")
.sort("checkin_count", ascending=False))
business_checkin.show()
[Stage 23:==================================================> (7 + 1) / 8]
+--------------------+-------------+---------------+-------------+ | business_id| latitude| longitude|checkin_count| +--------------------+-------------+---------------+-------------+ |7sxYa0-TwWeWGFr5C...|33.6407227072| -84.4276356697| 150005| |EqUqaLJxMDxIqvnma...| 45.58979| -122.595204| 75511| |2PxZ-fICnd432NJHe...|28.4278388741| -81.308182904| 63982| |UlndVD4tezU3FACjZ...| 30.202473| -97.666861| 54657| |vITc1UkWLvG9pVdaO...|42.3622379746| -71.0210756391| 54361| |cyBm7p3D2RAoGlDn-...| 28.417663| -81.581212| 25208| |NvAYQvjLxwGC-kcWz...| 28.374694| -81.549404| 24017| |4CxF8c3MB7VAdY8zF...|45.5226120114|-122.6730836114| 20291| |OPfgKOm_n-ajUo3qj...| 45.5230969| -122.681325| 19941| |m3DeBd0NSbAGDjUOk...|49.1942850587| -123.18076195| 18388| |BOFD5UmhHvCn_XH3_...| 28.370971| -81.519392| 17033| |C_k727ws82eMe9xtJ...| 39.9980574| -82.8838088| 16848| |OQ2oHkcWA8KNC1Lsv...| 45.5230858| -122.64168| 15946| |qeuJgUdcmL5yAweOs...| 45.5246501| -122.6818687| 15451| |Tjgte2R4e-3JWj4aA...| 28.3575294| -81.5582714| 15378| |zioLxtBc9THNS2TOn...| 45.504723| -122.632105| 14934| |5y2zZGIE2a4tuEJ5F...| 30.270557| -97.753177| 14153| |5xS76cdT7THG14H_W...| 28.359719| -81.591313| 14151| |PrsvO1rzkgg6qFizl...|42.3641813865| -71.0542184385| 13360| |o_L9Ss4boqq6ZEF9x...| 45.5289444| -122.6982801| 12202| +--------------------+-------------+---------------+-------------+ only showing top 20 rows
review_dates = (review
.join(business, "business_id")
.select("review_id", "business_id", "date")
.sort("business_id", "date", ascending=False)
.withColumn("month", F.split(F.col("date"), "-")[1])
.withColumn("year", F.split(F.col("date"), "-")[0]))
review_dates.show()
review_count_by_date = (review_dates
.select("month", "year")
.sort("year", "month", ascending=False))
review_count_by_date.groupBy("month").pivot("year").count().sort("month", ascending=True).show()
review_by_year = (review_count_by_date
.groupBy("year").count()
.sort("year"))
review_by_year.show()
+--------------------+--------------------+-------------------+-----+----+ | review_id| business_id| date|month|year| +--------------------+--------------------+-------------------+-----+----+ |RIlyZDUUL7dn-wX9R...|zzzKmD9Mj6WtJwJUh...|2012-01-05 07:25:19| 01|2012| |Beiu8FUpabvKp_tIj...|zzzKmD9Mj6WtJwJUh...|2011-07-06 03:11:19| 07|2011| |NsbPfBB7VcKavo2kP...|zzzKmD9Mj6WtJwJUh...|2011-01-30 07:26:01| 01|2011| |ZLg9JZBm8fti3kdcK...|zzzKmD9Mj6WtJwJUh...|2010-12-26 08:06:25| 12|2010| |_PDg4GKwqstcU4jqS...|zzzKmD9Mj6WtJwJUh...|2010-05-05 21:14:56| 05|2010| |7Me9k1V0KNyqAds43...|zzzKmD9Mj6WtJwJUh...|2010-03-07 20:52:10| 03|2010| |PEgXvQE2ZKGQZWqt5...|zzzKmD9Mj6WtJwJUh...|2009-10-22 19:53:49| 10|2009| |coFCcXcoVnTLwNMct...|zzzKmD9Mj6WtJwJUh...|2009-10-13 08:08:57| 10|2009| |79BNQZEWEcZjZe4G7...|zzxIO4bjWCQrvCJ0O...|2020-06-05 12:17:10| 06|2020| |DWbIic86a-H_8nau6...|zzxIO4bjWCQrvCJ0O...|2020-01-09 21:53:36| 01|2020| |6LVhP-V7QtsjN34Ao...|zzxIO4bjWCQrvCJ0O...|2019-08-15 22:00:46| 08|2019| |h5xb4XzKaFr_hBS1_...|zzxIO4bjWCQrvCJ0O...|2018-05-09 11:46:59| 05|2018| |64mGlTN9U8uh3qLQq...|zzxIO4bjWCQrvCJ0O...|2011-11-02 14:02:07| 11|2011| |K0DGtUbEh9mKHrLh4...|zzwK-TJsCJX5wZrdt...|2016-08-02 16:31:57| 08|2016| |uQzssdiURqnvdQVPd...|zzwK-TJsCJX5wZrdt...|2016-07-24 19:06:03| 07|2016| |xgfrMLg0mBFkbzZyN...|zzwK-TJsCJX5wZrdt...|2016-07-19 04:10:43| 07|2016| |7aAloTeX7KceHKivy...|zzwK-TJsCJX5wZrdt...|2016-07-19 01:17:27| 07|2016| |d6KrSTDkBcRmqBZMh...|zzwK-TJsCJX5wZrdt...|2016-07-18 19:54:03| 07|2016| |tseKwcVWB-PBSLzbK...|zzwK-TJsCJX5wZrdt...|2015-05-06 14:45:58| 05|2015| |nw1N6IVukv-Ez_JTL...|zzwK-TJsCJX5wZrdt...|2014-05-06 16:05:21| 05|2014| +--------------------+--------------------+-------------------+-----+----+ only showing top 20 rows
+-----+----+----+----+----+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+------+-----+-----+-----+ |month|2004|2005|2006|2007| 2008| 2009| 2010| 2011| 2012| 2013| 2014| 2015| 2016| 2017| 2018| 2019| 2020| 2021| +-----+----+----+----+----+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+------+-----+-----+-----+ | 01|null| 1| 863|4758|10197|17799|24147|35023|40899|44374|54145|70979|82320|80022| 85867|86871|76453|44461| | 02|null| 4|1267|4466| 9863|15605|23299|33158|37994|41225|49677|63486|78146|75541| 80386|81578|74612| null| | 03|null| 25|1497|4543|10755|17154|26245|37661|41296|44404|58557|76281|81878|87367| 94662|93600|50136| null| | 04|null| 48|1873|4557|10733|15891|24116|35072|37682|42412|56016|72023|80990|87630| 93447|86074|24036| null| | 05|null| 70|1754|6012|10903|16180|23697|35669|38974|44925|58678|78086|84232|91233| 94852|91534|33660| null| | 06|null| 502|1593|6576|11580|17062|25276|36004|39974|44714|61867|78012|83452|91644| 97191|93595|43657| null| | 07|null|2412|1546|7575|14964|19087|29174|40410|43052|49496|69241|85882|90116|99411|107107|97029|50353| null| | 08|null|2271|2028|6668|17138|21619|32210|41051|42557|53526|71823|89623|86628|93197|100206|94767|54419| null| | 09|null| 320|2101|6367|14139|18689|27601|35003|37909|48239|62774|76155|77204|83352| 86467|83076|51666| null| | 10| 38| 290|2514|6662|13553|19210|28034|35750|39256|49242|63055|77803|77082|85783| 84966|82121|52815| null| | 11| 2| 212|3308|6485|13646|18389|26970|33680|37203|46552|60272|70896|68240|77382| 77851|73277|44667| null| | 12| 12| 284|3475|7247|12965|17112|26814|32711|35645|46631|60014|68303|70239|76995| 81333|74047|45417| null| +-----+----+----+----+----+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+------+-----+-----+-----+
[Stage 35:======================================================> (51 + 1) / 52]
+----+-------+ |year| count| +----+-------+ |2004| 52| |2005| 6439| |2006| 23819| |2007| 71916| |2008| 150436| |2009| 213797| |2010| 317583| |2011| 431192| |2012| 472441| |2013| 555740| |2014| 726119| |2015| 907529| |2016| 960527| |2017|1029557| |2018|1084335| |2019|1037569| |2020| 601891| |2021| 44461| +----+-------+
review_by_year_pd = review_by_year.toPandas()
fig = px.bar(review_by_year_pd, x="year", y="count", title="Total Reviews Per Year")
fig.show()
checkin_date = (checkin
.join(business, "business_id")
.withColumn("checkin_count", F.size(F.split(F.col("date"), ",")))
.select("business_id", "date", "checkin_count")
.sort("checkin_count", ascending=False)
.withColumn("month", F.split(F.col("date"), "-")[1])
.withColumn("year", F.split(F.col("date"), "-")[0]))
checkin_date.show()
checkin_counts_by_year = (checkin_date
.groupBy("year")
.sum("checkin_count")
.withColumnRenamed("sum(checkin_count)", "count")
.sort("year"))
checkin_counts_by_year.show()
+--------------------+--------------------+-------------+-----+----+ | business_id| date|checkin_count|month|year| +--------------------+--------------------+-------------+-----+----+ |7sxYa0-TwWeWGFr5C...|2010-01-17 02:00:...| 150005| 01|2010| |EqUqaLJxMDxIqvnma...|2010-01-16 04:58:...| 75511| 01|2010| |2PxZ-fICnd432NJHe...|2010-01-21 17:52:...| 63982| 01|2010| |UlndVD4tezU3FACjZ...|2010-01-17 13:19:...| 54657| 01|2010| |vITc1UkWLvG9pVdaO...|2010-01-18 23:59:...| 54361| 01|2010| |cyBm7p3D2RAoGlDn-...|2010-01-30 23:31:...| 25208| 01|2010| |NvAYQvjLxwGC-kcWz...|2010-01-16 17:37:...| 24017| 01|2010| |4CxF8c3MB7VAdY8zF...|2010-01-17 23:35:...| 20291| 01|2010| |OPfgKOm_n-ajUo3qj...|2010-01-18 05:12:...| 19941| 01|2010| |m3DeBd0NSbAGDjUOk...|2010-01-19 18:16:...| 18388| 01|2010| |BOFD5UmhHvCn_XH3_...|2010-02-07 02:27:...| 17033| 02|2010| |C_k727ws82eMe9xtJ...|2010-01-18 21:50:...| 16848| 01|2010| |OQ2oHkcWA8KNC1Lsv...|2010-02-13 17:19:...| 15946| 02|2010| |qeuJgUdcmL5yAweOs...|2010-01-16 22:32:...| 15451| 01|2010| |Tjgte2R4e-3JWj4aA...|2010-02-11 18:49:...| 15378| 02|2010| |zioLxtBc9THNS2TOn...|2010-01-17 04:40:...| 14934| 01|2010| |5y2zZGIE2a4tuEJ5F...|2010-01-16 02:57:...| 14153| 01|2010| |5xS76cdT7THG14H_W...|2010-02-21 14:55:...| 14151| 02|2010| |PrsvO1rzkgg6qFizl...|2010-01-18 18:40:...| 13360| 01|2010| |o_L9Ss4boqq6ZEF9x...|2012-04-14 19:15:...| 12202| 04|2012| +--------------------+--------------------+-------------+-----+----+ only showing top 20 rows
[Stage 46:====================================> (5 + 3) / 8]
+----+--------+ |year| count| +----+--------+ |2010|11717005| |2011| 1634183| |2012| 1255754| |2013| 980808| |2014| 902834| |2015| 756902| |2016| 585924| |2017| 392881| |2018| 259931| |2019| 128699| |2020| 26019| |2021| 352| +----+--------+
checkin_counts_by_year_pd = checkin_counts_by_year.toPandas()
fig = px.bar(checkin_counts_by_year_pd, x="year", y="count", title="Total Check-Ins Per Year")
fig.show()
user_reviews_location = (user
.join(review, 'user_id')
.join(business, 'business_id'))
user_reviews_location.select('user_id', 'business_id', 'latitude', 'longitude').show()
[Stage 56:======================================================> (51 + 1) / 52]
+--------------------+--------------------+-------------+--------------+ | user_id| business_id| latitude| longitude| +--------------------+--------------------+-------------+--------------+ |--hJsDxzXZURcLxaL...|Ln-8CbKGZGmF-GCqM...| 33.7787398| -84.409326| |--hJsDxzXZURcLxaL...|QS_ojFa7FpfOVThrs...| 33.923006| -84.3399804| |-0Ji0nOyFe-4yo8BK...|1hJEqPHe65oJmO6sc...| 30.3479823| -97.7353261| |-0KAIqdO-DhrIXcY9...|WyIk1ZWy31Qaa8xUi...| 30.0135114| -97.8635298| |-0KAIqdO-DhrIXcY9...|GtlDQ8GG8ykG1ion3...|30.0127452517|-97.8907751147| |-0KAIqdO-DhrIXcY9...|OUZIlIytYIiqPj2RF...| 30.0169045| -97.8633635| |-0KAIqdO-DhrIXcY9...|_Fs1js9sOEFDJ2Txg...|30.0116336043|-97.8900940009| |-0KAIqdO-DhrIXcY9...|k_JMC4M3BJ4X9pC5X...| 30.0708447| -97.873136| |-0KAIqdO-DhrIXcY9...|9BBnf7fLTs2EIyDFo...| 30.0280744| -97.8765501| |-0KAIqdO-DhrIXcY9...|4-zgy_r5j82FiX2ex...| 30.2288371| -97.788686| |-0KAIqdO-DhrIXcY9...|4-zgy_r5j82FiX2ex...| 30.2288371| -97.788686| |-0KAIqdO-DhrIXcY9...|KY-EXj--9PlSBGhOS...| 30.0130756| -97.8917627| |-0KAIqdO-DhrIXcY9...|N2lTmVjYrQrw1nl9M...| 30.1769379| -97.7904271| |-0atCwlr6SSpYniMw...|juSETXWf57AtH5rh2...| 30.2619868| -97.722894| |-1KKYzibGPyUX-Mwk...|ROa5tRU4lUn1ffu0H...| 33.757101775|-84.3881696716| |-1KKYzibGPyUX-Mwk...|SFqFFIA4Ks2oHfgEA...| 28.440381| -81.4698224| |-25JZ2VjrGZfXR8Ea...|S0QEoRla89uZyI1BC...|28.6069310793| -81.209671868| |-25JZ2VjrGZfXR8Ea...|t5EgwRXG6MZSihdqu...| 28.8114547| -81.2666383| |-25JZ2VjrGZfXR8Ea...|_OvpwIXRqRXkVjCgu...| 30.3954441| -97.7269882| |-25JZ2VjrGZfXR8Ea...|x6SoT6Zg8mrIx1uAi...| 30.2928986| -97.741776| +--------------------+--------------------+-------------+--------------+ only showing top 20 rows
review_counts_pd = review_counts.toPandas()
fig = px.density_mapbox(review_counts_pd, lat='latitude', lon='longitude', z='review_count', title="Review Density Heatmap", radius=10, zoom=2.75,
mapbox_style="stamen-terrain")
fig.show()
business_checkin_pd = business_checkin.toPandas()
fig = px.density_mapbox(business_checkin_pd, lat='latitude', lon='longitude', z='checkin_count', title="Check-In Density Heatmap", radius=10, zoom=2.75,
mapbox_style="stamen-terrain")
fig.show()
fig = px.scatter_mapbox(business_checkin_pd, lat='latitude', lon='longitude', color="checkin_count", size="checkin_count",
color_continuous_scale=px.colors.cyclical.IceFire, size_max=15, zoom=3)
fig.show()
# group reviews and checkins
# get count for each business grouped by months
review_checkin = (review
.join(checkin, ["business_id", "date"], "full")
.withColumn("year", F.split(F.col("date"), "-")[0])
.withColumn("combined_count", F.size(F.split(F.col("date"), ",")))
.select("business_id", "year", "combined_count")
.groupBy(["business_id", "year"])
.sum("combined_count"))
ratings = (review
.withColumn("year", F.split(F.col("date"), "-")[0])
.select("business_id", "year", "stars")
.groupBy(["business_id", "year"])
.mean("stars"))
ratings_reviews = (review_checkin
.join(ratings, ["business_id", "year"], "full"))
ratings_reviews.sort(F.desc("sum(combined_count)")).show()
[Stage 69:=====================================================>(197 + 3) / 200]
+--------------------+----+-------------------+------------------+ | business_id|year|sum(combined_count)| avg(stars)| +--------------------+----+-------------------+------------------+ |7sxYa0-TwWeWGFr5C...|2010| 150157|3.0592105263157894| |EqUqaLJxMDxIqvnma...|2010| 75582| 4.450704225352113| |2PxZ-fICnd432NJHe...|2010| 64063|3.2839506172839505| |UlndVD4tezU3FACjZ...|2010| 54720| 4.111111111111111| |vITc1UkWLvG9pVdaO...|2010| 54437| 3.486842105263158| |cyBm7p3D2RAoGlDn-...|2010| 25267| 4.220338983050848| |NvAYQvjLxwGC-kcWz...|2010| 24095| 4.256410256410256| |4CxF8c3MB7VAdY8zF...|2010| 20795|3.7876984126984126| |OPfgKOm_n-ajUo3qj...|2010| 20176| 4.753191489361702| |m3DeBd0NSbAGDjUOk...|2010| 18432| 4.136363636363637| |BOFD5UmhHvCn_XH3_...|2010| 17060|3.7777777777777777| |C_k727ws82eMe9xtJ...|2010| 16875|3.4444444444444446| |OQ2oHkcWA8KNC1Lsv...|2010| 16146| 4.495| |qeuJgUdcmL5yAweOs...|2010| 15571|3.7916666666666665| |Tjgte2R4e-3JWj4aA...|2010| 15411|3.9393939393939394| |zioLxtBc9THNS2TOn...|2010| 15202| 3.955223880597015| |5y2zZGIE2a4tuEJ5F...|2010| 14265| 4.223214285714286| |5xS76cdT7THG14H_W...|2010| 14201| 4.16| |PrsvO1rzkgg6qFizl...|2010| 13687|4.0672782874617734| |o_L9Ss4boqq6ZEF9x...|2012| 12381|4.4636871508379885| +--------------------+----+-------------------+------------------+ only showing top 20 rows
windowval = (Window.partitionBy('business_id').orderBy('date')
.rangeBetween(Window.unboundedPreceding, 0))
review_count = (review
.withColumn('rev_count', F.count('stars').over(windowval))
.withColumn("month", F.split(F.col("date"), "-")[1])
.withColumn("year", F.split(F.col("date"), "-")[0])
.select('business_id', 'month', 'year', 'stars', 'rev_count')
.groupBy(["business_id", "month", "year"])
.sum("rev_count"))
review_count.show()
[Stage 70:======================================================> (51 + 1) / 52]
+--------------------+-----+----+--------------+ | business_id|month|year|sum(rev_count)| +--------------------+-----+----+--------------+ |-36nnCT71XE0InJXK...| 07|2014| 1| |-36nnCT71XE0InJXK...| 02|2015| 2| |-36nnCT71XE0InJXK...| 07|2015| 3| |-36nnCT71XE0InJXK...| 11|2015| 4| |-36nnCT71XE0InJXK...| 03|2016| 5| |-36nnCT71XE0InJXK...| 09|2016| 6| |-36nnCT71XE0InJXK...| 10|2016| 15| |-36nnCT71XE0InJXK...| 12|2016| 9| |-36nnCT71XE0InJXK...| 02|2017| 33| |-36nnCT71XE0InJXK...| 04|2017| 13| |-36nnCT71XE0InJXK...| 05|2017| 14| |-36nnCT71XE0InJXK...| 06|2017| 31| |-36nnCT71XE0InJXK...| 12|2017| 17| |-36nnCT71XE0InJXK...| 01|2018| 37| |-36nnCT71XE0InJXK...| 06|2018| 41| |-36nnCT71XE0InJXK...| 07|2018| 69| |-36nnCT71XE0InJXK...| 08|2018| 25| |-36nnCT71XE0InJXK...| 09|2018| 26| |-36nnCT71XE0InJXK...| 03|2019| 27| |-36nnCT71XE0InJXK...| 05|2019| 28| +--------------------+-----+----+--------------+ only showing top 20 rows
windowval = (Window.partitionBy('business_id').orderBy('date')
.rangeBetween(Window.unboundedPreceding, 0))
review_sum_stars = (review
.withColumn('running_sum', F.sum('stars').over(windowval))
.withColumn("month", F.split(F.col("date"), "-")[1])
.withColumn("year", F.split(F.col("date"), "-")[0])
.select('business_id', 'month', 'year', 'running_sum')
.groupBy(["business_id", "month", "year"])
.sum("running_sum"))
review_sum_stars.show()
[Stage 72:======================================================> (51 + 1) / 52]
+--------------------+-----+----+----------------+ | business_id|month|year|sum(running_sum)| +--------------------+-----+----+----------------+ |-36nnCT71XE0InJXK...| 07|2014| 1.0| |-36nnCT71XE0InJXK...| 02|2015| 2.0| |-36nnCT71XE0InJXK...| 07|2015| 7.0| |-36nnCT71XE0InJXK...| 11|2015| 8.0| |-36nnCT71XE0InJXK...| 03|2016| 9.0| |-36nnCT71XE0InJXK...| 09|2016| 14.0| |-36nnCT71XE0InJXK...| 10|2016| 35.0| |-36nnCT71XE0InJXK...| 12|2016| 21.0| |-36nnCT71XE0InJXK...| 02|2017| 69.0| |-36nnCT71XE0InJXK...| 04|2017| 29.0| |-36nnCT71XE0InJXK...| 05|2017| 31.0| |-36nnCT71XE0InJXK...| 06|2017| 65.0| |-36nnCT71XE0InJXK...| 12|2017| 34.0| |-36nnCT71XE0InJXK...| 01|2018| 75.0| |-36nnCT71XE0InJXK...| 06|2018| 86.0| |-36nnCT71XE0InJXK...| 07|2018| 142.0| |-36nnCT71XE0InJXK...| 08|2018| 50.0| |-36nnCT71XE0InJXK...| 09|2018| 51.0| |-36nnCT71XE0InJXK...| 03|2019| 52.0| |-36nnCT71XE0InJXK...| 05|2019| 53.0| +--------------------+-----+----+----------------+ only showing top 20 rows
review_count_sum = (review_count
.join(review_sum_stars, ['business_id','month', 'year'], 'full')
.select('business_id', 'month', 'year', 'sum(rev_count)', 'sum(running_sum)')
)
review_count_sum.show()
[Stage 76:====================================================> (193 + 7) / 200]
+--------------------+-----+----+--------------+----------------+ | business_id|month|year|sum(rev_count)|sum(running_sum)| +--------------------+-----+----+--------------+----------------+ |--2mEJ63SC_8_08_j...| 11|2016| 6| 14.0| |--UNNdnHRhsyFUbDg...| 08|2016| 7132| 31341.0| |--xmN4i0Hoqx0pPtr...| 11|2017| 44| 180.0| |-0F6zJ4parKu5WMoh...| 08|2016| 6| 26.0| |-0U6xz5cKY1PB_TJh...| 05|2015| 120| 401.0| |-0XME_3EJtOIyiTaY...| 11|2012| 6| 24.0| |-0iqnv7MjKrgh7Q7b...| 04|2017| 365| 1381.0| |-11j5tUL810_k3Soo...| 11|2019| 21| 68.0| |-1b5PAJFq9L-EXEzP...| 02|2014| 234| 543.0| |-1lP2wOmxWidM6Ge4...| 09|2020| 18| 58.0| |-2CqsOEUICBAkdyu7...| 05|2018| 1| 5.0| |-2MT2xutGx_6EeZHA...| 08|2013| 19| 95.0| |-2S-_MGNtX1FcRaTO...| 11|2016| 8| 26.0| |-2aMFgYft7rED0bst...| 08|2017| 3| 15.0| |-2joeHbqY9TayADes...| 01|2016| 195| 784.0| |-2joeHbqY9TayADes...| 05|2019| 302| 1232.0| |-2joeHbqY9TayADes...| 07|2018| 279| 1141.0| |-384bwLp3FoesKs7c...| 02|2018| 90| 447.0| |-3H3oRyCbLuGHVVdq...| 06|2014| 2| 10.0| |-3VQI-QXI7fr4TWH0...| 10|2016| 757| 2783.0| +--------------------+-----+----+--------------+----------------+ only showing top 20 rows
review_checkin = (review
.join(checkin, ["business_id", "date"], "full")
.withColumn("month", F.split(F.col("date"), "-")[1])
.withColumn("year", F.split(F.col("date"), "-")[0])
.withColumn("combined_count", F.size(F.split(F.col("date"), ",")))
.select("business_id", "month", "year", "combined_count")
.groupBy(["business_id", "month", "year"])
.sum("combined_count"))
review_checkin.sort("year", "month").show()
+--------------------+-----+----+-------------------+ | business_id|month|year|sum(combined_count)| +--------------------+-----+----+-------------------+ |F6Qsav77WqGhstQ1E...| 10|2004| 1| |uSM_sCp22BJWeatTD...| 10|2004| 1| |rdS7hBBeukiX4Led9...| 10|2004| 1| |RThn3_Y6qN8MsqZKW...| 10|2004| 1| |dr9PgB_kTp998OYWi...| 10|2004| 2| |_VkLbBvqBMmDJ9EUO...| 10|2004| 1| |Hnx34LxtfW4XV0gZ1...| 10|2004| 1| |Aigcg-qqohtSmjDci...| 10|2004| 2| |sf04_4kcxi6UkXMUl...| 10|2004| 1| |_irBz9yudNSlDUVAe...| 10|2004| 1| |GPhnm_rcjkbc_b96D...| 10|2004| 1| |VvhGHzjyOdStXblX8...| 10|2004| 1| |CGbuorlhKS88aKwhb...| 10|2004| 1| |0Ov-g-4eTOTSgVDBf...| 10|2004| 1| |pvlM--HZY1a8SqMXi...| 10|2004| 1| |54ElwAyN-o8e4uvOk...| 10|2004| 1| |w7JNKVadLeZoqfpTG...| 10|2004| 1| |bJLeoKF04fHYNoxpa...| 10|2004| 1| |j6PJEamZViym5thvz...| 10|2004| 1| |JdVlUaXM9aZQ68tjg...| 10|2004| 1| +--------------------+-----+----+-------------------+ only showing top 20 rows
enough_data = (review_checkin
.groupBy("business_id")
.count())
enough_data.show()
[Stage 85:=====================================================>(198 + 2) / 200]
+--------------------+-----+ | business_id|count| +--------------------+-----+ |dW4pD2rGTIhwCLCtP...| 85| |6qlWc-OVhTQfcyVi3...| 29| |j2NOygXjaTufrsKFl...| 58| |88kosdu9NbLZnHrio...| 85| |D2bmsnOkBIt72WYxO...| 28| |RhGnmMMEaNm3JwKKD...| 56| |euJD4G-BgXYY5BgFl...| 76| |kh9qwwqUcGwgexuiU...| 33| |FbZLY5XASP9phBySt...| 39| |q_Hw7TE8IXJtaQICG...| 8| |6KGBXOeSJYf9ePdyA...| 161| |8Lnid7N3bcAWY--CI...| 28| |PBfFmFy8C0WOtpHF_...| 148| |u7s0q_yGEihxnsS0m...| 27| |MhYtCVMzARhHpApK4...| 123| |8A_OhuJIr6k0Czdz4...| 36| |CeeO1uNYB0uvTNC6M...| 31| |3KqpiLDAjeeMmZeU-...| 52| |W60O4ast9uAq03n7n...| 153| |r7-YSA7Z8p4KISZ-j...| 132| +--------------------+-----+ only showing top 20 rows
enough_data = (enough_data
.where(enough_data['count'] >= 24))
enough_data.show()
[Stage 90:===================================================> (192 + 8) / 200]
+--------------------+-----+ | business_id|count| +--------------------+-----+ |dW4pD2rGTIhwCLCtP...| 85| |6qlWc-OVhTQfcyVi3...| 29| |j2NOygXjaTufrsKFl...| 58| |88kosdu9NbLZnHrio...| 85| |D2bmsnOkBIt72WYxO...| 28| |RhGnmMMEaNm3JwKKD...| 56| |euJD4G-BgXYY5BgFl...| 76| |kh9qwwqUcGwgexuiU...| 33| |FbZLY5XASP9phBySt...| 39| |6KGBXOeSJYf9ePdyA...| 161| |8Lnid7N3bcAWY--CI...| 28| |PBfFmFy8C0WOtpHF_...| 148| |u7s0q_yGEihxnsS0m...| 27| |MhYtCVMzARhHpApK4...| 123| |8A_OhuJIr6k0Czdz4...| 36| |CeeO1uNYB0uvTNC6M...| 31| |3KqpiLDAjeeMmZeU-...| 52| |W60O4ast9uAq03n7n...| 153| |r7-YSA7Z8p4KISZ-j...| 132| |Ukr4_xiBTUPOJ8ySP...| 38| +--------------------+-----+ only showing top 20 rows
final_df = (review_count_sum
.join(review_checkin, ['business_id', 'month', 'year'], 'inner')
.join(enough_data, 'business_id', 'inner'))
final_df.show()
[Stage 101:===================================================> (194 + 6) / 200]
+--------------------+-----+----+--------------+----------------+-------------------+-----+ | business_id|month|year|sum(rev_count)|sum(running_sum)|sum(combined_count)|count| +--------------------+-----+----+--------------+----------------+-------------------+-----+ |-36nnCT71XE0InJXK...| 05|2019| 28| 53.0| 1| 25| |-36nnCT71XE0InJXK...| 12|2017| 17| 34.0| 1| 25| |-36nnCT71XE0InJXK...| 09|2016| 6| 14.0| 1| 25| |-36nnCT71XE0InJXK...| 03|2016| 5| 9.0| 1| 25| |-36nnCT71XE0InJXK...| 07|2014| 1| 1.0| 1| 25| |-36nnCT71XE0InJXK...| 06|2017| 31| 65.0| 2| 25| |-36nnCT71XE0InJXK...| 03|2020| 30| 59.0| 1| 25| |-36nnCT71XE0InJXK...| 09|2019| 29| 58.0| 1| 25| |-36nnCT71XE0InJXK...| 01|2018| 37| 75.0| 2| 25| |-36nnCT71XE0InJXK...| 02|2017| 33| 69.0| 3| 25| |-36nnCT71XE0InJXK...| 07|2020| 65| 125.0| 2| 25| |-36nnCT71XE0InJXK...| 05|2017| 14| 31.0| 1| 25| |-36nnCT71XE0InJXK...| 11|2015| 4| 8.0| 1| 25| |-36nnCT71XE0InJXK...| 10|2016| 15| 35.0| 2| 25| |-36nnCT71XE0InJXK...| 08|2018| 25| 50.0| 1| 25| |-36nnCT71XE0InJXK...| 02|2015| 2| 2.0| 1| 25| |-36nnCT71XE0InJXK...| 06|2018| 41| 86.0| 2| 25| |-36nnCT71XE0InJXK...| 09|2018| 26| 51.0| 1| 25| |-36nnCT71XE0InJXK...| 03|2019| 27| 52.0| 1| 25| |-36nnCT71XE0InJXK...| 07|2015| 3| 7.0| 1| 25| +--------------------+-----+----+--------------+----------------+-------------------+-----+ only showing top 20 rows
x1 = final_df.select('sum(rev_count)').rdd.flatMap(lambda x: x).collect()
x2 = final_df.select('sum(running_sum)').rdd.flatMap(lambda x: x).collect()
y = final_df.select('sum(combined_count)').rdd.flatMap(lambda x: x).collect()
x1 = np.array(x1).reshape(-1,1)
x2 = np.array(x2).reshape(-1,1)
xx = [[i[0],j[0]] for i, j in zip(x1,x2)]
reg = LinearRegression().fit(xx, y)
# print(f"linear regression score: {reg.score(x, y)}")
print(f"linear regression score: {reg.score(xx, y)}")
print(f"coefficient: {reg.coef_}")
print(f"y-intercept: {reg.intercept_}")
linear regression score: 0.00013265360198433385 coefficient: [-5.83676050e-05 9.08684594e-05] y-intercept: 6.196444054201265